Genre Document Classification Using Flexible Length Phrases

نویسندگان

  • Danijel Radošević
  • Jasminka Dobša
  • Dunja Mladenić
  • Zlatko Stapić
  • Miroslav Novak
چکیده

In this paper we investigate possibility of using phrases of flexible length in genre classification of textual documents as an extension to classic bag of words document representation where documents are represented using single words as features. The investigation is conducted on collection of articles from document database collected from three different sources representing different genres: newspaper reports, abstracts of scientific articles and legal documents. The investigation includes comparison between classification results obtained by using classic bag of words representation and results obtained by using bag of words extended by flexible length phrases.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Combining classifiers for flexible genre categorization of web pages

With the increase of the number of web pages, it is very difficult to find wanted information easily and quickly out of thousands of web pages retrieved by a search engine. To solve this problem, many researches propose to classify documents according to their genre, which is another criteria to classify documents different from the topic. Most of these works assign a document to only one genre...

متن کامل

Content-free Document Genre Classification using First Order Random Graphs

We approach the general problem of machineprinted document genre classification using contentfree layout structure analysis. Document genre is determined from the layout structure detected from scanned binary images of the document pages, using no OCR results and minimal a priori knowledge of document logical structures. Our approach uses attributed relational graphs (ARGs) to represent the lay...

متن کامل

Genres of Digital Documents: Introduction to the Special Issue

Purpose – To introduce the special issue on “Genres of digital documents.” While there are many definitions of genre, most include consideration of the intended communicative purpose, form and sometimes expected content of a document. Most also include the notion of social acceptance, that a document is of a particular genre to the extent that it is recognized as such within a given discourse c...

متن کامل

Genres of Digital Documents Introduction to the Special

Purpose: This article introduces the Special Issue on Genres of Digital Documents. While there are many definitions of genre, most include consideration of the intended communicative purpose, form and sometimes expected content of a document. Most also include the notion of social acceptance, that a document is of a particular genre to the extent that it is recognized as such within a given dis...

متن کامل

Fine-Grained Document Genre Classification Using First Order Random Graphs

We approach the general problem of classifying machine-printed documents into genres. Layout is a critical factor in recognizing fine-grained genres, as document content features are similar. Document genre is determined from the layout structure detected from scanned binary images of the document pages, using no OCR results and minimal a priori knowledge of document logical structures. Our met...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2008